Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret
نویسندگان
چکیده
In this paper we consider the problem of learning the optimal dynamic policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when played yields a non-negative reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a time horizon T . The reward process of each arm is a finite state Markov chain, whose transition probabilities are unknown to the player. State transitions of each arm is independent of the player’s actions, thus “uncontrolled”. We propose a learning algorithm with near-logarithmic regret uniformly over time with respect to the optimal (dynamic) finite horizon policy, referred to as strong regret, to contrast with commonly studied notion of weak regret which is with respect to the optimal (static) single-action policy. We also show that when an upper bound on a function of the system parameters is known, our learning algorithm achieves logarithmic regret. Our results extend the literature on optimal adaptive learning of Markov Decision Processes (MDPs) to Partially Observed Markov Decision Processes (POMDPs). Finally, we provide numerical results on a variation of our proposed learning algorithm and compare its performance and running time with other bandit algorithms.
منابع مشابه
Optimal Adaptive Learning in Uncontrolled Restless Bandit Problems
In this paper we consider the problem of learning the optimal policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when pulled yields a positive reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a time horizo...
متن کاملLearning in A Changing World: Non-Bayesian Restless Multi-Armed Bandit
We consider the restless multi-armed bandit (RMAB) problem with unknown dynamics. In this problem, at each time, a player chooses K out of N (N > K) arms to play. The state of each arm determines the reward when the arm is played and transits according to Markovian rules no matter the arm is engaged or passive. The Markovian dynamics of the arms are unknown to the player. The objective is to ma...
متن کاملCascading Bandits: Learning to Rank in the Cascade Model
A search engine usually outputs a list of K web pages. The user examines this list, from the first web page to the last, and chooses the first attractive page. This model of user behavior is known as the cascade model. In this paper, we propose cascading bandits, a learning variant of the cascade model where the objective is to identify K most attractive items. We formulate our problem as a sto...
متن کاملOne Practical Algorithm for Both Stochastic and Adversarial Bandits
We present an algorithm for multiarmed bandits that achieves almost optimal performance in both stochastic and adversarial regimes without prior knowledge about the nature of the environment. Our algorithm is based on augmentation of the EXP3 algorithm with a new control lever in the form of exploration parameters that are tailored individually for each arm. The algorithm simultaneously applies...
متن کاملGeneralized Thompson Sampling for Contextual Bandits
Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sam...
متن کامل